Pangenomic analysis of Chinese gastric cancer

Pangenomic study might improve the completeness of human reference genome (GRCh38) and promote precision medicine. Here, we use an automated pipeline of human pangenomic analysis to build gastric cancer pan-genome for 185 paired deep sequencing data (370 samples), and characterize the gene presence-absence variations (PAVs) at whole genome level. Genes ACOT1, GSTM1, SIGLEC14 and UGT2B17 are identified as highly absent genes in gastric cancer population. A set of genes from unaligned sequences with GRCh38 are predicted. We successfully locate one of predicted genes GC0643 on chromosome 9q34.2. Overexpression of GC0643 significantly inhibits cell growth, cell migration and invasion, cell cycle progression, and induces cell apoptosis in cancer cells. The tumor suppressor functions can be reversed by shGC0643 knockdown. The GC0643 is approved by NCBI database (GenBank: MW194843.1). Collectively, the robust pan-genome strategy provides a deeper understanding of the gene PAVs in the human cancer genome.

Normal_AllReads: P = 5.2e-05). Note: ***, P ≤ 0.001; NS., not significant; Tumor_AllReads, de novo assembly was performed using all reads of tumor tissues; Tumor_HalfReads, de novo assembly was performed using half of reads from tumor tissues; Normal_AllReads, de novo assembly was performed using all reads of matched normal mucosae. Venn diagram where only homozygous deletion SVs were used in GRCh38-based SV method. Both in subfigures a and b, for each gene category, the gene number was the average value in all samples. c The presence/absence patterns of 212 genes identified as absence by GCPAN-based PAV method but presence by GRCh38-based SV method in at least one normal mucosae sample. The twelve distributed genes significantly associated with clinical phenotypes were marked to the right side of the heatmap. Note：(-) represents for gene absence, and (+) represents for gene presence.
d The 203 genes identified as absence by PAV method but presence by SV method in at least one primary tumor samples. The names of 12 genes associated with tumor phenotypes were listed on the right of the heatmap. e and f the ratios (e) and numbers (f) of SV insertion sequences aligned to novel sequences of GCPAN (orange) and SV insertion sequences aligned to the human reference genome GRCh38 (purple). Fig. 32 The upstream regulators predicted by IPA for genes related to GC0643 PAVs. This figure was generated by IPA (version 70750971) upstream regulation analysis. The 10 up-regulated and 138 down-regulated genes with fold change > 3 were input together. The 17 upstream regulators and the relationship with P-value less than 0.001 are presented. Supplementary Fig. 33 The positive control result targeting housekeeping gene PPIB by RNAScope. To examine the mRNA transcript of GC0643, the probe targeting housekeeping gene PPIB was used to examine the mRNA expression in gastric mucosa tissue. The positive signals present as red dots, indicating that RNA integrity of gastric mucosa is proper for mRNA detection, and the reagent quality is qualified (RNA in situ hybridization, 200×magnification). Fig. 34 The shRNA interfering sequences targeting GC0643 gene.

Supplementary
The sequences of shRNA2 and shRNA3 showed well efficacy.
Supplementary Fig. 35 The primers of GC0643 mRNA expression detection.
Housekeeping gene GAPDH was used and internal control. Fig. 36 The biological functions of GC0643 overexpression. a Effect of GC0643 mRNA transcription on NCI-N87 cancer cells by qRT-PCR. Data are presented as mean values ± SD from n = 3 biological replicates. Data were analyzed statistically by two-tailed Student's t test. *** P = 2.17E-05. b Increased protein level was observed after gene transfection. c Cell growth activity was suppressed after gene transfection based on CCK-8 assays. Data are presented as mean values ± SD. of 7 biological replicates. Data were analyzed statistically by two-tailed Student's t test with ** P = 0.008, *** P = 4.61E-05. d Representative images of EdU positive cells and nuclei stained with Hoechest from n = 5 biological replicates. Scale bar represents 20μm. Data are presented as mean values ± SD. of 7 biological replicates. Data were analyzed statistically by two-tailed Student's t test with ** P = 0.007. e Overexpression of GC0643 induced cell apoptosis. Data are presented as mean values ± SD. of 3 biological replicates. P values were derived from two-tailed Student's t test with *** P = 0.0001. f Overexpression of GC0643 changed cell cycle resulted in G2/M arrest with increased G2/M fraction. Data are presented as mean values ± SD. of 4 biological replicates. P values were derived from two-tailed Student's t test with ** P = 0.001. Source data are provided as a Source Data file.

Supplementary tables
Supplementary      Comparing the assembled results of matched normal mucosa, the assembled genome sizes of tumor tissue were significantly larger (Supplementary Fig. 2a), and the N50 sizes of tumor tissue were significantly larger (Supplementary Fig.2b). Because all female individuals are lack of Y chromosome, the assembled genome sizes of male individuals were larger than that of female individuals (Supplementary Fig.2c). There is no significant difference of N50 sizes between male and female individuals ( Supplementary Fig.2d).
We randomly selected sequencing datasets of ten tumor tissues and extracted the subset of reads, which the number of reads equals to that of their matched normal mucosae. The subset reads were conducted de novo assembly by SGA with the optimal parameters used in matched normal mucosa 5 . Then the assembled genome sizes from different data sets were compared. After eliminating the difference of sequencing depth, the assembled genome sizes were consistent with that of matched normal mucosa, suggested that the discrepancy of assembled genome size between tumor tissue and matched normal mucosa is due to the different sequencing depth and assembled parameter of SGA (Supplementary Fig.3).

Identification of non-reference sequences
In current study, fully unaligned sequences are defined as contigs with no alignment to the reference human genome using MUMMER with default parameters (-c 65 -l 20) while partially unaligned sequences are defined as contigs with at least one alignment and one unaligned fragment longer than 500 bps. All assembled contigs were compared with the human reference genome, resulted in a total of 949.  Fig. 4). The total length of fully unaligned (Supplementary Fig.5a) and partially unaligned (Supplementary Fig.5b) contigs in few genomes was obviously large due to potential contaminations. After removing the contaminated sequences from non-human species, there were ~4.80 Mbp fully unaligned contigs and ~5.67 Mbp partially unaligned contigs for each sample. There was no significant difference between females and males for both fully unaligned contigs (Supplementary Fig.5c) and partially unaligned contigs (Supplementary Fig.5d).
Helicobacter pylori (HP) and Epstein-Barr virus (EBV) are two well-known microorganisms related with gastric carcinogenesis 6 . To examine whether the non-human sequences were from HP and EBV, we aligned the fully unaligned contigs from each sample to the reference genome sequences of HP (GenBank accession: GCF_000008525.1) and EBV (GenBank accession: GCA_900474115.1) by nucmer implemented in MUMmer program 7 with the parameters "--maxmatch -l 65", respectively. Only those contigs that have any alignment with at least 95% sequence identity and 95% sequence coverage were considered as the true sequences from HP or EBV. In total, we discovered the sequences hitting of HP genome in 52 individuals (including 52 matched normal mucosae and 12 tumor tissues) (Supplementary Table 2) and EBV in 8 individuals (including 1 matched normal mucosa and 8 tumor tissues) (Supplementary Table 3). In addition, we could not identify any sequences in partially unaligned contigs aligned to the sequences of HP and EBV, suggested that the HP and EBV sequences did not integrate into human genome. For validation, we extracted the reads could not be mapped to the constructed pan-genome sequences for each sample and aligned them to reference genome sequences of two microorganisms by BWA MEM 8 with default parameters. Only these reads with CIGAR flag as "150M" in the alignment result files were counted. And the results were accordance with that of contig alignments (Supplementary Fig.6).

Annotation of the non-reference sequences
For non-reference sequences annotation, low-complexity repeats on non-reference sequences were firstly masked by RepeatMasker (http://www.repeatmasker.org). The masked sequences were used to conducted ab initio gene prediction by SNAP (version 2006-07-28) 9 and Augustus (version 2.5.5) 10 .
The "human" and "mammal" models were selected for Augustus and SNAP prediction, respectively. RNA-seq data of gastric tissues and public human ESTs perditions were combined and refined with RNA and protein evidence by EVidenceModeller 12 . Predicted genes with length shorter than 100 bp were firstly removed. All the remaining genes were clustered at a global identity of 80% using CDH-HIT and further remove potential redundancy at a global identity of 50% by aligned to reference genome and transcripts. Genes with incomplete models or overlapping with repeat sequences more than 50% were further removed. Finally, 14 full-length genes were predicted from the non-reference genome sequences (Supplementary Table 4). All predicted genes were functionally annotated by searching their protein sequences using the software InterProScan 13 (version 5.39-77.0). to extract non-reference sequences. All non-reference sequences were merged and redundant sequences were removed with a cutoff of sequence identity 90% using CD-HIT (version v4.6.7) 14 with parameters "-c 0.9 -T 16". This step was performed for the fully unaligned sequences and partially unaligned sequences separately. Then the non-redundant sequences were aligned to NT database (downloaded from NCBI, November 1st, 2019) by BLASTN with parameters "-evalue 1e-05 -outfmt 7

Construction and annotation of GCPAN
-max_target_seqs 1 -num_threads 16". Contigs, whose best alignments (alignment length ≥100 bp and sequence identity ≥ 60%) were not from primates, were considered as potential contaminations and were discarded for further analysis. We added the non-redundant non-reference sequences into GRCh38 to construct the sequences of GCPAN. The annotation of the human reference genome part was from GENCODE (version 30) 15 .

Closing breakpoints by partially unaligned contigs
We collected all the partially unaligned contigs from the genomes of tumor tissue and matched normal mucosa, respectively, and extracted coordinates of their aligned regions on the human reference genome according to the QUAST result. The unaligned regions (≥ 500bp) of the partially unaligned contigs were extracted. To ensure sequences reliability, we only extracted sequences that attached by ≥ 500 bases alignment length with ≥ 95% identity or ≥ 100 bases alignment length with ≥ 99% identity. If both ends of a sequence could be aligned to the same chromosome region (≤ 100 kb) with the same orientation, we called the sequence as a two-end placed sequence. And if only one end of a sequence could be aligned to the human reference genome, we called the sequence as a one-end placed sequence.
We clustered two-end-placed sequences from all samples based on their positions on the human reference genome using "BEDtools merge" 16 Table 5). The rest of 672 genes were overlapped in intron regions.
Despite the high base-pair accuracy and comprehensive representation of human reference genome, there are still hundreds of N-gaps regions accumulating more than 150 Mbp sequences in the last version GRCh38 17 . In order to examine that whether the gaps were closed by our assembled results, we used "bedtools intersect" 16 to determine whether the alignment coordinates of two-end-placed sequences could span the reference gaps. In total, 25 gaps of human reference genomes could be completely closed by the two-end-placed novel sequences, resulting in an increasing of 258.60 kb sequences to the reference genome. And majority of the closed gaps were detected in multiple individuals (Supplementary Table 6). In addition, by checking whether the positions of all one-placed sequences were located within 100 bp upstream or downstream of reference gaps, we found that 16 gaps could be extended in both sides, and 53 gaps could be extended in one side, resulting in an increasing of 328.97 kb sequences to non-reference genome region (Supplementary Table 7)

Comparing read mapping ratios using different reference genomes
We randomly selected 1,000 (10 3 ) to one million sequencing reads (10 6 ) from the sequencing data of an individual NA12878 from public database and mapped them by Bowtie2 (version 2.3.3.1) 18 with default parameter to GRCh38 primary assembly sequences. We found that the mapping ratio of one million reads was close enough to the overall mapping ratio using all reads (Supplementary Fig.7a). We then randomly sampled one million reads from each sample and mapped them to three reference sequences: GRCh38 primary assembly sequences, GRCh38 primary assembly sequences plus alternative loci and patch sequences (GRCh38_alt), and currently  Fig.7b).

Positioning predicted genes on chromosomes by the third generation sequencing data
We aligned sequences of 14 predicted genes to the third generation sequencing contigs. If the global identity of a predicted gene is greater than 80% of its protein coding regions, this gene is considered as a gene supported by the third generation sequencing data. As results, 13 of the 14 genes are confirmed (Supplementary Fig.8).  Fig.9).

Comparison of PAVs with SGDP and 90 Han datasets
Of the distributed genes, 195 genes (186 from GRCh38 and 9 predicted genes) were considered as distributed genes in both primary tumor tissues and matched gastric mucosae (Supplementary Table 9 and 10). Other 36 and 30 distributed genes were detected in matched gastric mucosa and primary tumor tissues, respectively (Supplementary Table 11 and 12). The percentage of distributed genes in predicted genes was more than 64%, while the percentage of distributed genes in GRCh38 was approximately 1% (Supplementary Fig.10). In addition, the majority of distributed genes on GRCh38 were sporadically missed in a small number of samples, whereas the distributed genes of non-reference sequences evenly spread across all samples ( Supplementary Fig.11). A total of 210 protein coding genes, including 199 GRCh38 genes and 11 predicted genes were outlined as distributed genes in SGDP dataset based on currently constructed GCPAN (Supplementary Fig.12). Of the 210 distributed genes, 108 genes (51.42%) were overlapped with our dataset, including 99 GRCh38 genes and 9 predicted genes (Supplementary Fig.13, 14).
We compared PAVs of healthy individuals of SGDP with PAVs of gastric cancer, and identified 78 distributed genes on GRCh38 with significant difference in two populations. The absent variation of ZNF718 is omitted in the consequent analysis due to the extremely short coding region of the representative transcript we selected.  Fig.15). In addition, 19 of them are core genes in the 90 Han Chinese genomes and the remaining 167 genes have absence in at least one individual. It meant that majority of gene PAVs found in gastric cancer population are supported by the 90 Han Chinese dataset.
For the distributed genes with significant difference in gastric cancer population and SGDP non-Asian population, we further compared absence frequencies between gastric cancer population and 90 Han Chinese dataset, and found two genes with higher absence frequency in gastric cancer population than that in 90 Han Chinese One was ACOT1, and another was PRAMEF14 (Supplementary Fig.16). In all populations, the gene absence frequencies of PRAMEF14 were lower than those for ACOT1. In addition, the predicted gene GC0643 also showed higher gene abasence frequencies in Asian populations than that in SGDP non-Asian population ( Supplementary Fig.17).
We also compared the PAVs of our dataset with dataset of non-Asian population of SGDP, and found that gene absence frequencies were significantly different in 85 distributed genes (Fisher's exact test, P < 0.05). Among the 85 differential genes, 82 genes were located on GRCh38 and 3 genes were predicted genes ( Supplementary   Fig.18a). We calculated the odds ratio (OR) of these 85 distributed genes and found that 75 genes (72 on GRCh38) showed OR > 1.5 (represented the absent frequencies are high in our group), while 10 genes (all are genes of GRCh38) with OR < 0.5 (represented the absent frequencies are low in our group) (Supplementary Table 14). It suggested that cancer individuals carried more absent variations of susceptible genes than that in non-Asian individuals from SGDP dataset (Supplementary Fig.18b).
We further compared the PAVs of Asian population of SGDP with our dataset and found that gene absence frequencies were significantly different in 71 distributed genes (Fisher's exact test, P < 0.05). Among the 71 differential genes, 67 genes were located on GRCh38 and 4 genes were predicted genes (Supplementary Fig.19). We calculated the odds ratio (OR) of those 71 distributed genes and found that 69 genes (65 on GRCh38) showed OR > 1.5 (represented the absent frequencies are high in our group), while two genes (all are genes of GRCh38) with OR < 0.5 (represented the absent frequencies are low in our group) (Supplementary Table 15).
We further compared the PAVs of East Asian population of SGDP with our dataset and found that gene absence frequencies were significantly different in 49 distributed genes (Fisher's exact test, P < 0.05). Among the 49 differential genes, 47 genes were located on GRCh38 and two genes were predicted genes ( Supplementary   Fig.20). We calculated the odds ratio (OR) of those 49 distributed genes and found that 48 genes (46 on GRCh38) showed OR > 1.5 (represented the absent frequencies are high in our group), while one gene (all are genes of GRCh38) with OR < 0.5 (represented the absent frequencies are low in our group).

Correlation analysis of distributed genes with clinicopathological phenotypes
For distributed genes on GRCh38, we conducted functional analysis by  Fig.22). We further analyzed the association of clinical phenotypes of gastric cancer with all distributed genes (Supplementary Fig.23) and found that absence of HLA-DRB1 and HLA-DRB5 were mainly identified in ulcerative or infiltrative ulcerative types of Borrmann classification ( Supplementary   Fig.24). Histologically, most of cancers revealed moderate-to poor-differentiated adenocarcinoma. The absence of PSG9 and PSG4 was closely related to increased incidence of gastric cancer in corpus and cardia of stomach (Supplementary Fig.23 and 25). The larger tumor diameters, the higher frequencies of PIM3 absence (Supplementary Fig.23 and 26). Gene structure analysis showed that the absence of PIM3 occurred in exon 5 and exon 6 region. In addition, the expression level of PIM3 was significantly lower for PIM3 absence (P < 0.001, Supplementary Fig.27).
Furthermore, we divided absence variation of distributed genes into absence and presence groups (80% CDS coverage as cut-off) and evaluated the influence on patients' outcome. Several gene absences disclosed significant prognostic correlation, including AZU1 (Supplementary Fig.28), PRRX2 (Supplementary Fig.29), and SIGLEC14 (Supplementary Fig.30). We noticed that the absence of SIGLEC14 significantly influenced the prognosis of female patients, but not male patients. Most of these cancers occurred in the cardia and corpus of the stomach with ulcerative or infiltrative ulcerative classification. Histologically, cancers with SIGLEC14 absence showed overwhelming poorly-differentiated adenocarcinoma (73.91%) and compatible with intestinal-type cancer in Lauren classification, suggesting that the absence of SIGLEC14 in women may play important role on carcinogenesis of non-signet-ring cell carcinoma.

Comparison of the relationship of PAVs and SVs on GRCh38
In SV analysis, over 2700 genes were identified as absence ( Supplementary   Fig.31a), and only 9 genes were homozygous DEL-SVs (Supplementary Fig.31b). If only homozygous gene PAVs and DEL-SVs were compared, most genes were identified as presence congruously by SV methods and PAV method. There are 203 genes identified as absence by PAV method in at least one tumor samples (Supplementary Table 17). Of these 203 genes, 33 genes were absent in at least 54 (30%) samples (Supplementary Fig. 31c and 31d). Among these 33 genes, twelve genes were found significantly associated with clinical phenotypes. We extracted insertion SVs (INS-SVs) to compare with unaligned sequences in GCPAN analysis.
The insertion sequences were firstly masked by TRF (version 4.09.1) 23  with an average length of 76 bps were mapped to GRCh38 (Supplementary Fig. 31e and 31f).

Upstream regulation analysis of GC0643-PAV-associated genes
In RNA-Seq validation study (n = 65), we divided the data into gene absence group (n = 18) and gene presence group (n = 47) based the 80% coverage of CDS 5. At gene absence group, up-regulated expression was found in 10 genes (3 fold-change, P < 0.001), while down-regulated expression was detected in 138 genes (Supplementary   Table 18). We further studied the function of GC0643 by analyzing the upstream regulators of these 10 up-regulated and 138 down-regulated genes. IPA (version 70750971, QIAGEN, 2021) upstream regulation analysis was applied ( Supplementary   Fig.32, Supplementary Table 19).

GC0643 examination and biological functions in vitro
Human gastric cancer cells HGC27 and NCI-N87 were used. For knockdown of GC0643, shRNAs were used (Supplementary Fig.34). Non-targeting control shRNA was used as negative control. Stably transfected cells were then validated by mRNA ( Supplementary Fig.35) and protein expression analysis. We further verified the functions of GC0643 in NCI-N87 cell line(this cell line was from Western patient).
Enforced GC0643 expression on NCI-N87 significantly increased the mRNA and protein expression levels by RT-PCR and Western blot (Supplementary Fig. 36a, b).